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Abstract 

It has become increasingly clear that the current taxonomy of clinical phenotypes is mixed with molecular heterogeneity, 
which potentially affects the treatment effect for involved patients. Defining the hidden molecular-distinct diseases using 
modern large-scale genomic approaches is therefore useful for refining clinical practice and improving intervention 
strategies. Given that microRNA expression profiling has provided a powerful way to dissect hidden genetic heterogeneity 
for complex diseases, the aim of the study was to develop a bioinformatics approach that identifies microRNA features 
leading to the hidden subtyping of complex clinical phenotypes. The basic strategy of the proposed method was to identify 
optimal miRNA clusters by iteratively partitioning the sample and feature space using the two-ways super-paramagnetic 
clustering technique. We evaluated the obtained optimal miRNA cluster by determining the consistency of co-expression 
and the chromosome location among the within-cluster microRNAs, and concluded that the optimal miRNA cluster could 
lead to a natural partition of disease samples. We applied the proposed method to a publicly available microarray dataset of 
breast cancer patients that have notoriously heterogeneous phenotypes. We obtained a feature subset of 13 microRNAs 
that could classify the 71 breast cancer patients into five subtypes with significantly different five-year overall survival rates 
(45%, 82.4%, 70.6%, 100% and 60% respectively; p = 0.008). By building a multivariate Cox proportional-hazards prediction 
model for the feature subset, we identified has-miR-146b as one of the most significant predictor (p = 0.045; hazard 
ratios = 0.39). The proposed algorithm is a promising computational strategy for dissecting hidden genetic heterogeneity 
for complex diseases, and will be of value for improving cancer diagnosis and treatment. 
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Background 

The patients with similar prognosis often respond differently to 
the same treatment. One explanation for this is that disease with 
similar phenotypes may have different genetic causes, a phenom- 
ena of genetic heterogeneity. Such genetic heterogeneity presents a 
significant challenge for modern clinical practice and biomedical 
research on common human diseases. MicroRNAs are a class of 
small noncoding RNAs that regulate translation of protein coding 
mRNAs by translational inhibition or cleavage of the mRNA 
transcripts. In humans, miRNAs are approximately 22-nucleotide- 
long single-stranded RNAs[l]. They play a key role in the post- 
transcriptional regulation of up to 30% of protein-coding genes, 
and have a profound impact on many cellular processes during 
development and adult life. 

MicroRNA(MiRNA) expression profiles of tumour samples 
have recently been shown to provide phenotypic signatures for 
specific types of cancer [2-6], making it potentially useful in 
tackling the heterogeneity issues for complex human diseases. 
Recent studies have used DNA microarrays to study breast cancer, 
and have shown that it was possible to identify subgroups of 



patients in terms of different survival courses by gene expression 
profiles [7]. Blenkiron et al. [8] analysed the expression of miRNAs 
in breast tumour samples using a bead-based flow-cytometric 
profiling method. To our knowledge, this was the first integrated 
analysis of miRNA expression, mRNA expression and genomic 
changes in breast cancer. They showed that miRNAs could act as 
molecular signature to distinguish the subtypes of breast cancer, 
which would be unlikely to be discovered by traditional clinical 
approaches. They further identified distinctive microRNA signa- 
tures that correlated with cytogenetic and molecular subtypes of 
breast cancer. However, most methods that aimed to identify 
clinically relevant subtypes using microRNAs usually employed 
unsupervised learning techniques, such as hierarchical clustering, 
which would be of limited use when the disease heterogeneity 
results from only a small subset of the miRNAs participating in a 
particular cellular process. In these cases, the "signal" or relevant 
miRNAs may be overwhelmed by the "noise" generated by the 
vast majority of unrelated data. 

In this study, we aimed to identify a subset of miRNAs that 
could dissect breast cancer patients with different survival 
outcomes. We employed a coupled two-way clustering algorithm 
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(CTWC) [9-1 1] to iteratively partition breast cancer samples and 
microRNA sets. The partition was done by a super-paramagnetic 
clustering (SPC) algorithm [12] that can automatically determine 
the number of partitions, and generate robust stable phenotypic 
partitions and the significant feature signatures (miRNA subsets). 
For identifying the optimal miRNA cluster, we developed a 
functional consistency index to evaluate the functional consistency 
of the significant miRNA subsets. The index considers not only 
expression correlation but also chromosome distance between 
miRNAs within a significant miRNA subset. The index was 
developed based on the following considerations. With this 
measurement, the optimal miRNA cluster is expected to not only 
dissect breast cancer samples efficiendy, but also function in a 
coherent manner. Finally we applied the proposed method to a 
publicly available breast cancer microRNA expression profiling 
dataset, and then employed the Kaplan-Meier survival analysis 
[13] and multivariate Cox proportional-hazards prediction mod- 
elling [14] to determine the differential survival outcomes of new 
subtypes. 

Materials and Methods 

Dataset Preprocessing 

A miRNA expression data (GSE7842) was download from the 
Gene Expression Omnibus (GEO)(http://www.ncbi.nlm.nih.gov/ 
geo/). The raw dataset consisted of the expression profiles of 309 
miRNAs in 137 samples in which 1 19 passed quality control (93 
primary human breast tumours, 21 cell lines and five normal 
breast samples). The original breast cancer study used a single 
sample predictor (SSP) to assign individual samples to one of five 
breast tumour subtypes: Luminal A(ER+, PR+, HER2— ), 
Luminal B(ER+, PR+, HER2+), Basal-like(ER-, PR-, HER2- 
), HER2+(ER— , PR-, HER2+) and Normal breast-like [8]. For 
heterogeneity analysis, we used the subset of the data that consists 
of the expression profiles of 1 33 miRNAs, which express in 7 1 
human breast cancers with specific clinical information (Meno- 
pause stage, Size, Grade, Total nodes, NPI (Nottingham 
Prognostic Index), Survival time, Dead, DFI and Age etc.) for 
subsequent analysis. 

Coupled two-way clustering 

The applied CTWC algorithm is a heuristic and iterative 
method, and is implemented by a stand-alone package. Super- 
Paramagnetic Clustering was used as the underlying clustering tool 
to partition the whole dataset into subsets of miRNA and samples 
iteratively until significant partitions (submatrices) are obtained. A 
detailed description of the SPC and CTWC algorithm can be 
found in [15]. For a microRNA expression profile matrix M, we 
denoted the initial sample set as Si , and the microRNA set mG\ . 
Clustering microRNA set mGi on the basis of their expression 
levels over the set of samples Sj was referred to the process in an 
operation denoted by mGi(Sj). Similarly defined, Sj(mGj) 
described the process of clustering Sj using all microRNA of 
mGi. We employed CTWC for identifying significant miRNA 
subsets in the breast cancer dataset. Specifically, first, we clustered 
all samples using all miRNAs to identify stable sample partitions, 
and clustered all miRNAs using all samples to identify stable 
miRNA subsets. Then, we clustered the miRNA gained in the 
previous step using the newly defined sample partitions (including 
all samples) to find the responsible miRNA subsets of high 
discriminating power. Finally, we clustered each sample partition 
again using each miRNA subset with high discriminating power. 
In the searching process, we explored the cluster depth for both 
dimensions of samples and miRNAs. The cluster depth selected 



was based on the empirical judgement on whether the clinical 
samples could be well separated using the candidate miRNA 

subset(s). 

Evaluation of a miRNA subset using the functional 
consistency score 

We obtained many high-correlation sample partitions and 
miRNA subsets by CTWC. To evaluate the power of dissecting 
tumour subtypes, we developed a functional consistency score that 
evaluates the biological significance of a miRNA subset. The steps 
for computing a functional consistency score of miRNA subsets 
mGi(i = 1,2 • • ■ ,JV)were as follows: 

(1) Correlation coefficient (CC). For each mGi, we 
computed the correlation coefficient 



m C(m Gi) = mean CC(mj ,m/) 

7=1,7=1 

where CC(mj,mi)h the Pearson correlation coefficient between 
microRNA mj and mi in mGi, and T is the number of microRNA 
in mGi. 

(2) Computing the chromosome cluster. Based on the 
chromosome location of each miRNA in human via MiRGen 
[16], we grouped human miRNAs into different clusters by 
requiring that in each cluster the maximum distance between any 
two miRNAs be less than 50K bp, and obtained 5 1 intergenic or 
gene-resident spatial clusters in which 38 overlap with the 133 
miRNAs in this study. 

For each pa.irP(mj,mi),(mjemGi,miemGi), we map the pair 
P(mj,mi) to 51 chromosome clusters. Then, we computed the 
number of the pair P(mj,mi) that overlaps the 51 chromosome 
clusters N(mj,mi). The chromosome consistency score of mGi is 
then defined as: 



CCS(mGi) = 



Tj =u= xP{mj,m,) 



where T is the total number of microRNA in rnGj. 
(3) Functional consistency. For eachwG;, 

FC(mGi) = mC{mGi) + CCS(mGi),{i = 1 ,2, • • • ,N) 



A higher FC(mGi) corresponds to a higher degree of functional 
consistency among the miRNAs involved in mGi. 

Survival analysis 

To verify the clinical significance of the identified hidden breast 
cancer subtypes, we plotted their survival curves by Kaplan-Meier 
product-limit method, and assessed the differences between the 
survival curves of breast cancer patients belonging to different 
subtypes by log-rank test. The multivariate Cox proportion- 
hazards model was used to predict the overall survival time, and to 
determine the significance (at significant level p<0.05) of the 
effects if the miRNAs is included in the identified miRNA subset(s) 
on the patient' survival months. Wald Chi-square test was used to 
determine the significance of each predictor's hazard toward the 
survival time. 
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Predicting the targets of optimized miRNA subset 

To further understand the function and dissect the involved 
pathways in the optimized miRNAs, we predicted the targets of 
the optimized miRNA subset mGn, using mirGen [16] which 
includes miRanda, PicTar and TargetScanS prediction programs. 
This is based on Sethupathy et at study [17], which found that the 
intersection of the predictions from prediction program could 
achieve both high sensitivity and specificity. In addition, we also 
considered the expression of predicted target genes of miRNAs. 
Those genes with anti-correlated expression of the miRNAs were 
filted and mapped to KEGG in order to study the underlining 
related pathways. 

Computational algorithms 

The flowchart of the proposed method was graphically depicted 
by Figure 1. The programming codes for computing a function 
consistency score are available upon request to the authors. The 
hierarchical dendrogram resulted from the coupled two-way 
clustering was plotted by Treeview[18,19]. 

Results 

Selection of miRNA subset that partitions breast cancer 
samples with highest functional consistency score 

We employed CTWC algorithm to search for significant 
miRNA subsets that could distinguish breast cancer subtypes. 
CTWC identified many highly correlated miRNA subsets during 
the recursive partitioning of samples and miRNAs. In this study we 
aimed to identify the partition of breast cancers [8]. With CTWC, 
we identified 75 stable and significant miRNA subsets at the 
cluster depth of one (mG2,mGi, ■ ■ ■ ^mG^) (Table 1). Then, we 
determined the function consistency scores of the 75 stable 
miRNA subsets separately. 

We computed and sorted the correlation coefficient of 75 
miRNA subsets. The CC (the average Pearson correlation 
coefficient across all pairs of miRNAs in the cluster, see Methods 
for details) of the top 5 subsets (mGn,mGs,mG2,mGe,mG5)2ive: 
0.565230769, 0.55772, 0.50603, 0.48114, 0.47562, respectively. 
wiG4has the highest CC. The CC of pairwise miRNAs in mG^is 
shown in Table SI. 

For 5 microRNA subsets with higher CC, we mapped all 
combinations of microRNAs pairs in each microRNA subset into 
51 different chromosome clusters based on MiRGen, and then 
computed a chromosome consistency score for each miRNA 
subset. We found that score of mG4was 2/78 = 0.0256. The other 
4 microRNA subsets have no miRNAs pairs belonging to any 
chromosome cluster. Based on the correlation coefficient and 
chromosome cluster of miRNA subsets, we found the has mG4the 
highest functional consistency score with 0.5908. 

Among the 13 miRNAs in mG 4 , miR-22 1/222 negatively 
regulates estrogen receptor alpha, and is associated with tamoxifen 
resistance in breast cancer [20] . The expression of miR-206 is 
down-regulated in estrogen receptor alpha-positive human breast 
cancer [21]. Recent data [22] indicated that the pattern of 
expression of miR- 146a and miR- 1 46b was similar, suggesting that 
their target genes might be coregulated, although they may be 
located on different chromosomes. Furthermore, the expression of 
miR-146a/b is high in those samples that have been classified as 
Basal-like. In the Basal-like cell lines with the highest miR-146a/b 
expression level, the amount of BRCA1 was particularly low. 
Further analysis revealed that the expression levels of miR- 146b 
was significandy elevated only compared to Luminal B and Basal- 
like subtype. In addition, miR- 143/ 145 microRNAs were 
repressed in Basal-like compared to Luminal subtype [23]. Lu et 



Iteratively peel off sample and 
miRNA subsets using the SPC- 
based two-way clustering algorithm 

I 

Compute the correlation coefficient of 
pairwise of miRNAs in each miRNA 
subset co-expression level 

* 

Map the pairwise miRNAs in each 
miRNA subset based on the 
chromosome clusters 

I 

Evaluate candidate miRNA subsets 
based on correlation coefficient and 
chromosome cluster to extract the 
highest scored subsets 

I 

Validate the newly identified 
subtypes based on their ability in 
predicting survival profiles of 
patients, and then identify the most 
important features via the 
multivariate Cox proportional- 
hazards model 



Figure 1. The graphic algorithm flow for the proposed SPC- 
based two-way clustering. 

doi:10.1371/journal.pone.0087601.g001 

al [24] demonstrated that the expression level of miR- 155 was 
inversely correlated with estrogen receptor. Our results are 
consistent with the recent discoveries. 

Clustering breast cancer samples using the selected 
miRNA subsets 

We applied mGn, using SPC to cluster 71 breast samples from 
Blenkiron et al. using SPC [8] . Here, the Euclidean distance and 
Pearson's correlation coefficient were used as the sample and the 
miRNA expression similarity measures, respectively. We success- 
fully partitioned the 71 samples into five subtypes (Figure 2). 

To verify the clinical significance of the identified hidden breast 
cancer subtypes, we plotted the survival curves by Kaplan-Meier 
product-limit method, and assessed the differences between the 
survival curves of breast cancer patients with different subtypes by 
a log-rank test (Figure 3). The 5 year survival rates for five subtypes 
were 45%, 82.4%, 70.6%, 100% and 60% (£ = 0.008), respective- 
ly- 

In order to develop a compact model for clinical use, we further 
identified miRNAs that contributed mostly to the high prediction 
power. Multivariate Cox proportional-hazards model was used to 
analyze miRNAs in mG\. To reduce the number of variables to be 
modeled, we applied the stepwise variable selection option (with 
the same inclusion and exclusion p value of 0.05) for the 
multivariate Cox proportional-hazards regression model. We 
found one predictor (miRNAs) in mGn, (Table 2). miR- 146b was 



PLOS ONE | www.plosone.org 



3 



January 2014 | Volume 9 | Issue 1 | e87601 



Unraveling the Heterogeneities of Breast Cancer 



Table 1. miRNA clusters using CTWC 
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selected to be the significant prognostic predictor because of its 
importance involved in the underlying pathogenic mechanisms for 
breast cancer. We also used the Cox model to select significant 
predictor for ER+ and ER-. In surprise, the miR-146b was also 
found to be the best predictor (p = 0.038). 

Predicting the function microRNAs in mG^ 

We used the Target-Function-Expression Module of miRGen to 
predict the targets of each microRNA in mG^and obtained 
miRNA-mRNA pairs. In addition, The corresponding mRNA 
expression profiling of 71 samples and 19,061 mRNAs (http:// 
www.ebi.ac.uk/ arrayexpress/ experiments/E-UCON- 1 /) was 
downloaded and 5228 pre-proceeded mRNAs [8] was selected 
after normalization. The Pearson's Correlation coefficient of each 
miRNA-mRNA pair was computed. Then, anti-associa- 
tion(CC< — 0.4) pairs were regarded as reliability. Here, we only 
provided the results of miR-146b targets. The number of predicted 
miRNA-mRNA pairs by miRGen is 1,311, among which 664 
have information of mRNA expression. 76 miRNA-mRNA pairs 
were anti-correlation in expression with CC< — 0.4 in Table S2. In 
these mRNAs, we also mapped all reliable targets to KEGG 
by DAVID database [25], and found that the target genes of 



miR- 1 46b mainly take part in MAPK signaling pathway, Toll-like 
receptor signaling pathway and Pathway in cancer. The targets of 
other miRNAs in mG 4 , such as miR-146a, miR-221 and miR-222, 
were also found to be involved in Toll-like receptor signaling 
pathway, indicating that this pathway may be associated with the 
subtyping of breast cancer. 

Discussion and Conclusions 

Increasing evidence have suggested that miRNAs are involved 
in cancer development through regulating distinct biological 
processes, including cellular growth and proliferation, cellular 
movement and migration, extra cellular matrix degradation. 
Though the expression level of miRNAs is generally low in 
cancer, their unique profiles may have significant clinical 
outcomes, especially for the phenotypes of cancer. Recently, 
many studies have shown that the abnormal expression of 
miRNAs is correlated with human breast cancer. In this study 
we showed that the two-way clustering algorithm could result in an 
improved prognostic accuracy over the breast cancer patients' 
survival profiles. 
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Figure 2. The five partitions of breast cancer were identified using ;»6 4 as the disease feature set in the breast cancer dataset. In the 

figure, each microRNA corresponds to a row, and each breast cancer sample corresponds to column. The 71 breast cancer samples were divided into 
five subtypes (Subtype 1, Subtype 2, Subtype 3, Subtype 4 and Subtype 5). Red areas indicate increased expression, and green areas decreased 
expression. Each column represents a single breast cancer sample, and each row represents a single microRNA. The dendrogram at the top shows the 
degree to which each breast cancer subtype is related to the others with respect to microRNA expression. 
doi:1 0.1 371 /journal.pone.0087601. g002 
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Figure 3. Survival curves for five subtypes of the breast cancer patients in the breast cancer dataset. 

doi:10.1371/journal.pone.0087601.g003 



Computational discoveries of the hidden subtypes for a complex 
disease have to be verified by some means, e.g., a functional assay 
using bioinformatics approaches or a clinical validation using 
epidemiological approaches such as survival analysis. In unsuper- 
vised clustering analysis, however, identifying the best subset for 
dissecting clinically heterogeneous disease can be a challenging 
task as no cross-validation can be done internally. The underlying 
assumption for a clustering algorithm is that microRNA with 
similar expression patterns and adjacent chromosome location are 
more likely to have a similar biological function(s). However, a 
clustering algorithm itself does not provide proof of the best 
grouping of microRNA in terms of biological functions. Thus, the 
biological interpretation of the disease clustering results relies 
heavily on the expert knowledge which often may be subjective 
[26] . Therefore, in this study, we designed a functional consistency 
score for evaluating a candidate miRNA subset in terms of 
functional consistency. In terms of the better-characterized 
functionality of subset mGs, and based on the significantly different 
survival results for the patients denned by the newly defined 
subtypes, the applied two-way clustering algorithm has been 
demonstrated to be a feasible and promising toolbox for peeling off 
molecular heterogeneities of complex human diseases. 



Table 2. Multivariate Cox proportional-hazards analysis 
based on the wiG4signature microRNAs relevant to survival 
time. 
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0.039(.002-.930) 



doi:1 0.1 371 /journal.pone.0087601 .t002 



Many methods use all the microRNAs on chips or a large 
number of microRNAs to predict patient survival. Since the vast 
majority of the microRNAs in a given dataset are irrelevant to the 
survival of the studied patients, this may reduce the prediction 
accuracy of the model because of the added noise. Hence, 
McLachlan et al. [27] proposed a mixture model-based approach 
to the clustering of microarray expression data. In this study, we 
applied an integrative approach that combines a SPC-based two- 
way clustering with a functional consensus to identify the 
functionally sounding and the most compact subset of microRNAs 
underlying the phenotypic partitions of patients. 

Application of the proposed approach to breast cancer datasets 
led to identification of microRNA subsets, and further multivariate 
Cox proportional-hazards modeling defined microRNA- 1 46b as 
one significant predictor for the survival of the breast cancer 
patients in the dataset. The individual miRNAs have only limited 
impact on their targets and multiple miRNAs are needed to 
drastically reduce transcription levels of target. We also found the 
targets of serval microRNAs in the optimal miRNA subset 
involoved similar pathways. Overall, our results demonstrated 
that the proposed approach is highly promising for peeling off the 
hidden genetic heterogeneity based on modern omics data, and 
may lead to an improved diagnosis and treatment of cancers. 

Supporting Information 

Table SI The correlation coefficient of among miRNAs 

in 111G4. 

(DOCX) 

Table S2 The anti-association miRNA-mRNA of among 
miR-146b. 

(DOCX) 
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